Parallelizing Word2Vec in Multi-Core and Many-Core Architectures

نویسندگان

  • Shihao Ji
  • Nadathur Satish
  • Sheng Li
  • Pradeep Dubey
چکیده

Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. [5, 6] have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with “Hogwild" updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose “HogBatch" by improving reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. The new algorithm is particularly suitable for modern multi-core/many-core architectures, especially Intel’s latest Knights Landing processors, and allows us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge. 1 From Hogwild to HogBatch We refer the reader to [5, 6] for an introduction to word2vec and its optimization problem. The original implementation of word2vec by Mikolov et al. 1 uses Hogwild [7] to parallelize SGD. Hogwild is a parallel SGD algorithm that seeks to ignore conflicts between model updates on different threads and allows updates to proceed even in the presence of conflicts. The psuedocode of word2vec Hogwild SGD is shown in Algorithm 1. The algorithm takes in a matrix MV×D in that contains the word representations for each input word, and a matrix MV×D out for the word representations of each output word. Each word is represented as an array of D floating point numbers, corresponding to one row of the two matrices. These matrices are updated during the training. We take in a target word, and a set of N input context words around the target as depicted in the top of Figure 1. The algorithm iterates over the N input words in Lines 2-3. In the loop at Line 6, we pick either the positive example (the target word in Line 8) or a negative example at random (Line 10). Lines 13-15 compute the gradient of the objective function with respect to the choice of input word and positive/negative example. Lines 17–20 perform the update to the entries Mout[pos/neg example] and Min[input context]. The psuedocode only shows a single thread; in Hogwild, the loop in Line 2 is parallelized over threads without any additional change in the code. Algorithm 1 reads and updates entries corresponding to the input context and positive/negative words at each iteration of the loop at Line 6. This means that there is a potential dependence between successive iterations they may happen to touch the same word representations, and each iteration must potentially wait for the update from the previous iteration to complete. Hogwild ignores such https://code.google.com/archive/p/word2vec/ 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. ar X iv :1 61 1. 06 17 2v 2 [ cs .D C ] 2 3 D ec 2 01 6 Algorithm 1 word2vec Hogwild SGD in one thread. 1: Given model parameter Ω = {Min,Mout}, learning rate α, 1 target word w out, and N input words {w in, w 1 in, · · · , w N−1 in } 2: for (i = 0; i < N; i++) { 3: input_word = w in; 4: for (j = 0; j < D; j++) temp[j] = 0; 5: // negative sampling 6: for (k = 0; k < negative + 1; k++) { 7: if (k = 0) { 8: target_word = w out; label = 1; 9: } else { 10: target_word = sample one word from V; label = 0; 11: } 12: inn = 0; 13: for (j = 0; j < D; j++) inn += Min[input_word][j] * Mout[target_word][j]; 14: err = label σ(inn); 15: for (j = 0; j < D; j++) temp[j] += err * Mout[target_word][j]; 16: // update output matrix 17: for (j = 0; j < D; j++) Mout[target_word][j] += α * err * Min[input_word][j]; 18: } 19: // update input matrix 20: for (j = 0; j < D; j++) Min[input_word][j] += α * temp[j]; 21: } dependencies and proceeds with updates regardless of conflicts. In theory, this can reduce the rate of convergence of the algorithm as compared to a sequential run. However, the Hogwild approach has been shown to work well in case the updates across threads are unlikely to be to the same word; and indeed for large vocabulary sizes, conflicts are relatively rare and convergence is not typically affected. target wout t input context word win i context window negative samples wout k from V

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallelizing Word2Vec in Shared and Distributed Memory

Word2Vec is a widely used algorithm for extracting low-dimensional vector representations of words. It generated considerable excitement in the machine learning and natural language processing (NLP) communities recently due to its exceptional performance in many NLP applications such as named entity recognition, sentiment analysis, machine translation and question answering. State-of-the-art al...

متن کامل

Efficient parallelization of the genetic algorithm solution of traveling salesman problem on multi-core and many-core systems

Efficient parallelization of genetic algorithms (GAs) on state-of-the-art multi-threading or many-threading platforms is a challenge due to the difficulty of schedulation of hardware resources regarding the concurrency of threads. In this paper, for resolving the problem, a novel method is proposed, which parallelizes the GA by designing three concurrent kernels, each of which running some depe...

متن کامل

Ultra-Low-Energy DSP Processor Design for Many-Core Parallel Applications

Background and Objectives: Digital signal processors are widely used in energy constrained applications in which battery lifetime is a critical concern. Accordingly, designing ultra-low-energy processors is a major concern. In this work and in the first step, we propose a sub-threshold DSP processor. Methods: As our baseline architecture, we use a modified version of an existing ultra-low-power...

متن کامل

A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs

Today, the challenge is to exploit the parallelism available in the way of multi-core architectures by the software. This could be done by re-writing the application, by exploiting the hardware capabilities or expect the compiler/software runtime tools to do the job for us. With the advent of multi-core architectures ([1] [2]), this problem is becoming more and more relevant. Even today, there ...

متن کامل

Parallelizing Compilation Scheme for Reduction of Power Consumption of Chip Multiprocessors

With the advance of semiconductor technology, chip multiprocessor architectures, or multi core processor architectures have attracted much attention to achieve low power consumption, high effective performance, good cost performance and short hardware/software development period. To this end, parallelizing compilers for chip multiprocessors are expected that allow us to parallelize program effe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1611.06172  شماره 

صفحات  -

تاریخ انتشار 2016